Introduction

This document explores qualitative indicators from an ActivityInfo database that is monitoring Ecuador.

Indicator count totals
Nov 2013 to May 2019
Date Quantity Select Single-line text Multi-line text % of total data collected
Nov 2013 141,442 30,531 0 6,309 3.54%
June 2015 1,887,857 745,841 85,863 57,128 2.06%
Sept 2016 3,380,991 1,296,548 191,640 116,184 2.33%
May 2017 4,932,977 1,809,419 265,196 168,599 2.35%
May 2019 12,174,327 7,595,829 2,683,945 915,948 3.92%



From the perspective of ActivityInfo, it shows a clear need for new tools to support analysis of qualitative data as the absolute volume of qualitative data has increased by a factor of 150, and almost doubled as a relative share of all data collected.


Data preparation

The data has been extracted from ActivityInfo and pre-processed to make it ready for the analysis. Source the R/etl.R file to download the raw data.

Data import & preparation

Read the data from the source that has been extracted, cleaned, and transformed. Select the rows where the field type equals to NARRATIVE, this indicates that is a multi-line text field in ActivityInfo. Select these columns and analyze them by comparing and contrasting with other fields types associated with the textual field types.

Partners


The table below shows partner count per each record:

  • ACNUR has nine hundred sixty-eight records, which is 62.0% of the total records.

  • Second, NRC has one hundred thirty-four records, which is 8.60% of the total records.

  • The most difference between percentages of the partners ACNUR and NRC is 53%.

partnerName freq prop
ACNUR 968 0.620
NRC 134 0.086
PMA 106 0.068
UNICEF 84 0.054
OIM 73 0.047
UNFPA 64 0.041
CARE 29 0.019
Dialogo Diverso 26 0.017
Mision Scalabriniana 18 0.012
ADRA 15 0.010
RET 15 0.010
PNUD 7 0.004
JRS Ecuador 5 0.003
OPS/OMS 5 0.003
Plan Internacional 5 0.003
World Vision 5 0.003
UNESCO 3 0.002

The table below shows the proportion of records entered by partners and sub-partners.

  • 676 out of 968 total responses of ACNUR is actually coming from HIAS.

  • UNICEF has more diversed partners in terms of reporting. 44% of responses of UNICEF comes from HIAS. 25% of reporting comes from the UNICEF itself.

  • Under PMA, there are 13 sub-partners. HIAS reports 41% of these records.

Those are the total numbers of reporting in all database, the numbers are not specific to the narratives (multi-line text fields). In the next section, we count the number of reportings done only in the narrative sections.


subPartnerName freq prop percent
ACNUR
HIAS 676 0.698 69%
ACNUR 284 0.293 29%
JRS Ecuador 5 0.005 0%
Federación de Mujeres de Sucumbios 2 0.002 0%
Federación de mujeres de Sucumbíos 1 0.001 0%
NRC
NRC 134 1.000 100%
OIM
OIM 73 1.000 100%
UNFPA
UNFPA 62 0.969 96%
RET 2 0.031 3%
PMA
HIAS 44 0.415 41%
ADRA 10 0.094 9%
Buen Pastor 5 0.047 4%
Fundación de Mujeres de Sucumbios 5 0.047 4%
Fundación Tarabita 5 0.047 4%
Hermanas Salesias 5 0.047 4%
Hogar de Cristo 5 0.047 4%
Pastoral Social Cáritas Tulcán 5 0.047 4%
SJR 5 0.047 4%
World Vision 5 0.047 4%
Alas de Colibri 4 0.038 3%
Casa Matilde 4 0.038 3%
Patronato 4 0.038 3%
UNICEF
HIAS 37 0.440 44%
ADRA 21 0.250 25%
UNICEF 21 0.250 25%
NRC 3 0.036 3%
Centro de Desarrollo y Autogestión 2 0.024 2%
CARE
CARE 29 1.000 100%
Dialogo Diverso
Dialogo Diverso 25 0.962 96%
OIM 1 0.038 3%
Mision Scalabriniana
Mision Scalabriniana 18 1.000 100%
ADRA
ADRA 15 1.000 100%
RET
RET 15 1.000 100%
PNUD
PNUD 7 1.000 100%
JRS Ecuador
JRS Ecuador 5 1.000 100%
OPS/OMS
OPS/OMS 5 1.000 100%
Plan Internacional
Plan Internacional 5 1.000 100%
World Vision
World Vision 5 1.000 100%
UNESCO
UNESCO 3 1.000 100%

Which partners and sub-partners are reporting in all fields?

ACNUR


ADRA


CARE


Dialogo Diverso


JRS Ecuador


Mision Scalabriniana


NRC


OIM


OPS/OMS


Plan Internacional


PMA


PNUD


RET


UNESCO


UNFPA


UNICEF


World Vision


Narrative data

In this section, we focus on a subset of the reports, which do particularly have the multi-text fields, called “Narrative data” in ActivityInfo terms. Plain saying that narrative data is multi-line text fields allowing users to enter long texts.

The number of partners and sub-partners reporting narrative data

As we have seen previously, Not all partners (and sub-partners) enter narrative records. For instance, the partner PMA has lots of sub-partners reporting for the different data types but there are no narratives there.

In terms of narrative data,

TODO The partners X and Y is like that. The rest of the main partners, namely … do not have any subpartners reporting by them.

  • The partners do not have any sub-partners reporting via them: CARE, Dialogo Diverso, JRS Ecuador, Mision Scalabriniana, NRC, OIM, OPS/OMS, Plan Internacional, PNUD, UNESCO, UNFPA

The number cantons and provinces recording narrative data

Canton and provinces
The number of reports in the multi-text (narrative) fields
canton freq canton.prop province.prop
PICHINCHA
QUITO 126 1.000 0.190
CARCHI
TULCAN 101 1.000 0.153
SUCUMBIOS
LAGO AGRIO 89 1.000 0.134
IMBABURA
IBARRA 76 1.000 0.115
GUAYAS
GUAYAQUIL 55 1.000 0.083
EL ORO
HUAQUILLAS 49 0.860 0.074
MACHALA 8 0.140 0.012
ESMERALDAS
ESMERALDAS 38 0.551 0.057
SAN LORENZO 29 0.420 0.044
ELOY ALFARO 2 0.029 0.003
SANTO DOMINGO DE LOS TSACHILAS
SANTO DOMINGO 38 1.000 0.057
AZUAY
CUENCA 30 1.000 0.045
COTOPAXI
LATACUNGA 4 1.000 0.006
LOS RIOS
QUEVEDO 4 1.000 0.006
TUNGURAHUA
BAÑOS DE AGUA SANTA 4 0.571 0.006
AMBATO 3 0.429 0.005
CHIMBORAZO
RIOBAMBA 3 1.000 0.005
MANABI
MANTA 2 1.000 0.003
BOLIVAR
SAN MIGUEL 1 1.000 0.002



Treemap plot showing canton and province reporting frequencies.

Analysis

Label forms recode table

First of all, we shorten the names and therefore re code form topics because they appear to be too long and disarray the plots. The re coded table below provides a look up for form labels and their abbreviations:

labelFormsRecode labelForms
Salud Salud
Agua Agua, saneamiento e higiene
Alojamiento Alojamiento Temporal
Necesidades Necesidades básicas/Otro
Población Manejo de la información y entrega directa de la información a la población
Socios Manejo de la información para socios y análisis de las necesidades
VBG Protección_VBG
Tráfico Trata_y_tráfico
Educación Acceso_a_educación
Hábitat Acceso a vivienda y hábitat dignos en comunidades receptoras
Técnico Medios de vida y formación técnico-profesional
SocialCohesión Cohesión_social
Educacional Apoyo Educacional a Comunidades Receptoras
VBG_SSR Asistencia técnica para VBG-SSR
Fronteras Asistencia técnica para protección/gestión de fronteras
Coordinacion Asistencia técnica para gestion de la informacion y coordinacion
SectorLaboral Asistencia técnica para el sector laboral
Protección Asistencia técnica para protección
ProtecciónInfancia Asistencia técnica para protección de la infancia
LGBTI Protección_LGBTI

Response quality

Response quality means how much response the questions receive. The idea is to find relations that affect the response quality to understand if they work or not under some conditions.

Research questions:

  • What is the quality of textual responses in the narrative fields?

  • Is there any relationship between the word counts of response, question and description fields?

  • What is the distribution between response word count and explanatory variables such as the question, form topic, canton name, partner name, etc.

Assumptions:

  • Responses with a larger word count have more quality than the responses with smaller word count.

In other words, we assume that the more word the better is. The limitations are based on the unequal distribution of the data. The word count of responses and questions can be related to other things, such as the questions require short answers so then the responses tend to be shorter.

Additionally, we can have a cross-analysis to test these outcomes. It might be a good idea to have a small subset of data and ask an expert to test the assumptions qualitatively. For instance, we can take the first twenty responses with the highest word count and the last twenty responses with the lowest word count. We chose the extreme directions because they point out the greatest differences which are easier to test assumptions.

Word count

One issue with the nature of the questions is that they are only unique in a form. These questions can be distributed across multiple forms. The questions sharing the same name will have different meanings. For instance, the question “Cualitativo” from the form “Salud” should imply different thing than the question “Cualitativo” from the form “Protección_VBG”.

In order to solve this kind of problem:

  • We can combine question with the form and also its folder label. There we can achieve a unique name for each question.

  • Another thing to resolve this would be doing analysis to move the analysis up to form level. In this file, we did both, therefore the analysis shown as below:

Count of responses per topic/question:

labelForms question response .responseWordCount .questionWordCount partnerName canton description labelFormsRecode
Salud Cualitativo 1. Entrega de k 302 1 UNFPA TULCAN Descripción de Salud
Salud Cualitativo 1. Entrega de k 302 1 UNFPA HUAQUILLAS Descripción de Salud
Salud Cualitativo 1. Entrega de k 302 1 UNFPA MACHALA Descripción de Salud
Salud Cualitativo 1. Entrega de k 302 1 UNFPA LAGO AGRIO Descripción de Salud
Salud Cualitativo Se complementa 13 1 UNFPA LAGO AGRIO Descripción de Salud
Salud Cualitativo 233 Equipos méd 46 1 UNFPA SAN LORENZO Descripción de Salud

It’s also a good practice to see the number of questions. For example, one question has two responses, therefore they’re short. Therefore, jittered points are added to give a glance about the number of observations in the same plot.

In the plot above, the box plot of form topics and response word counts based on the raw data, the outliers are shown in orange color. Outliers are the points placed outside the whiskers, which is the long line, of the boxplot.

The response word count distribution per form topic categorized by partner name:

The response word count distribution per form topic categorized by canton name:



A caveat: Reducing multiple values down to a single value should be avoided in the early stages of the analysis because reducing hides a lot e.g. a bar chart showing average the word count per partner. Some partners may write longer than others, because:

  1. They actually write longer than other partners.

  2. The questions they answered require short answers.

The Description field

Some questions have the description field giving extra details about the questions.

Do some questions with the extra description field have better response quality than the questions which do not have it?

Looking at the table containing form name, question, description and so on:

We see in the plot below that the response word counts per form and colored if a response has a description field or not. Having a description field or not is calculated as that a description field has a minimum one word.


The responses with the longest word counts are the ones with description. Nevertheless, it is not so easy to see a clear trend that there’s a correlation between response word count and description fields. Interestingly, the form topic Protección_VBG has no description fields at all in its form topics.

Analysis of Variance

TODO ANOVA

Correlation

TODO

The regression line

We can look at multiple continuous variables in our data.

  • word count of response field: the dependent variable.

  • word count of question field: an independent variable.

  • word count of description field: an independent variable.

Scatter plots help understand the characteristics of those variables. However, we miss a general understanding that is the trend line.

The gray area around the lines shows the confidence band at the 0.95 level. Although there’s a straight slope in the linear regression line, we cannot say that the trend line is robust because the confidence band representing the uncertainty in the estimate is wide.

Logistic regression

TODO

Text analysis

In that section, we take text as data.

Textual data preparation

Describe how to prepare textual data and what common steps are usually performed.

They are usually four steps involved in this process:

1. Tokenization

Tokenization means to split a text into tokens considered meaningful units of text. A token can either be a word (and often it is) or a group of words (such as bigram), or even a sentence that depends on the level of analysis.

labelFolder labelForms Month question description partnerName subPartnerName province canton labelFormsRecode word
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud 1
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud entrega
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud de
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud kits
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud de
Objectivo_1.1 Salud 2019-02 Cualitativo Descripción de procedimientos elaborados de referencia y contrareferencia de emergecisa obstétrica y neonatal, y otros problemas SSR UNFPA UNFPA CARCHI TULCAN Salud salud

Perform stemming, which you bring nouns/verbs back to infinitive forms, and tokenization, which is separating words in meaningful pieces.

2. Strip punctuation

Punctuation is often not required in text analysis (unless a researcher wants to tokenize the text based on a specific classifier such as sentence tokens); therefore, they create noise.

3. Convert text into lowercase

When the text turned into lowercase, for instance, the words respuesta and Respuesta will no longer be taken as different words.

4. Exclude stopwords & numbers

Stop words usually mean the most common words in a language that will bring no significant results in analysis. They are overly distributed in the text and they will not give so meaningful results itself. Stop-words are including articles (el/la), conjunctions (y), pronouns (yo/tú/etc.) and so on.

In text mining, this process is usually done after the text converted into lowercase so one does not have to provide stop words including both lower and sentence case versions.

We import a list of Spanish stopwords data (source here) and perform a filtering join returning tokens from textual data by excluding the words listed in the stopwords. that only returns the tokens not listed in the stopwords.

It’s also possible to add more custom words such as ACNUR or HIAS if such organization names are not desired in the results.

The original tokens had 37845 rows but after stop words, it decreased to 19221 and that the change in between is 51%.

The most common words

The most common words in all responses:

The most common words per topic

Salud


Agua


Alojamiento


Necesidades


Población


Socios


VBG


Tráfico


Educación


Hábitat


Técnico


SocialCohesión


Educacional


VBG_SSR


Fronteras


Coordinacion


SectorLaboral


Protección


ProtecciónInfancia


LGBTI


Sentiment analysis

Sentiment analysis (also called as opinion mining) is a technique to understand the emotional meanings of text given by a dictionary describing the positive/negative words that already done by humans.

The responses seem to be written with a formal tone of voice; therefore, the responses may not show any sentiment at all.

First, we find a sentiment lexicon for the Spanish language (source here).

A wordcloud showing positive and negative words:

References

Silge J, Robinson D (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc.

 
The QualMiner project explores the qualitative data used for Venezuelan refugee response by applying text analysis & mining techniques. The project is funded by the UNHCR Innovation Fund. This document last modified on: